{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "better-nlp-spacy-texacy-examples.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.3"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp/notebooks/google-colab/better_nlp_spacy_texacy_examples.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iFyUOrS1fczL"
      },
      "source": [
        "# Better NLP\n",
        "\n",
        "This is a wrapper program/library that encapsulates a couple of NLP libraries that are popular among the AI and ML communities.\n",
        "\n",
        "Examples have been used to illustrate the usage as much as possible. Not all the APIs of the underlying libraries have been covered.\n",
        "\n",
        "The idea is to keep the API language as high-level as possible, so its easier to use and stays human-readable.\n",
        "\n",
        "Libraries / frameworks covered:\n",
        "\n",
        "- SpaCy ([site](https://spacy.io/) | [docs](https://spacy.io/usage/))\n",
        "- Textacy ([github](https://github.com/chartbeat-labs/textacy) | [docs](https://chartbeat-labs.github.io/textacy/))\n",
        "\n",
        "See [https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp) for more details.\n",
        "\n",
        "### This notebook will demonstrate the below NLP features / functionalities, using the above mentioned libraries\n",
        "\n",
        "- Extract entities\n",
        "- Noun extraction\n",
        "- Gather facts\n",
        "- Obfuscate privacy details\n",
        "- Parts-of-speech"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Lre8GErufczN"
      },
      "source": [
        "#### Setup and installation ( optional )\n",
        "\n",
        "In case, this notebook is running in a local environment (Linux/MacOS) or _Google Colab_ environment and in case it does not have the necessary dependencies installed then please execute the steps in the next section.\n",
        "\n",
        "Otherwise, please SKIP to the **Install Spacy model ( NOT optional )** section."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "QJuCUOMOfczO",
        "outputId": "018e9102-de25-4fbe-b5b3-919c58644f00",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 5927
        }
      },
      "source": [
        "%%time\n",
        "%%bash\n",
        "\n",
        "apt-get install apt-utils dselect dpkg\n",
        "\n",
        "echo \"OSTYPE=$OSTYPE\"\n",
        "if [[ \"$OSTYPE\" == \"cygwin\" ]] || [[ \"$OSTYPE\" == \"msys\" ]] ; then\n",
        "    echo \"Windows or Windows-like environment detected, script not tested, and may not work.\"\n",
        "    echo \"Try installing the components mention in the install-[ostype].sh scripts manually.\"\n",
        "    echo \"Or try running under CGYWIN or git-bash.\"\n",
        "    echo \"If successfully installed, please contribute back with the solution via a pull request, to https://github.com/neomatrix369/awesome-ai-ml-dl/\"\n",
        "    echo \"Please give the file a good name, i.e. install-windows.sh or install-windows.bat depending on what kind of script you end up writing\"\n",
        "    exit 0\n",
        "elif [[ \"$OSTYPE\" == \"linux-gnu\" ]] || [[ \"$OSTYPE\" == \"linux\" ]]; then\n",
        "    TARGET_OS=\"linux\"\n",
        "else\n",
        "    TARGET_OS=\"macos\"\n",
        "fi\n",
        "\n",
        "if [[ -e ../../library/org/neomatrix369 ]]; then\n",
        "  echo \"Library source found\"\n",
        "  \n",
        "  cd ../../build\n",
        "  \n",
        "  echo \"Detected OS: ${TARGET_OS}\"\n",
        "  ./install-${TARGET_OS}.sh || true\n",
        "else\n",
        "  if [[ -e awesome-ai-ml-dl/examples/better-nlp/library ]]; then\n",
        "     echo \"Library source found\"\n",
        "  else\n",
        "     git clone \"https://github.com/neomatrix369/awesome-ai-ml-dl\"\n",
        "  fi\n",
        "\n",
        "  echo \"Library source exists\"\n",
        "  cd awesome-ai-ml-dl/examples/better-nlp/build\n",
        "\n",
        "  echo \"Detected OS: ${TARGET_OS}\"\n",
        "  ./install-${TARGET_OS}.sh || true \n",
        "fi"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reading package lists...\n",
            "Building dependency tree...\n",
            "Reading state information...\n",
            "apt-utils is already the newest version (1.4.9).\n",
            "dpkg is already the newest version (1.18.25).\n",
            "dselect is already the newest version (1.18.25).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 28 not upgraded.\n",
            "OSTYPE=linux-gnu\n",
            "Library source found\n",
            "Detected OS: linux\n",
            "Please check if you fulfill the requirements mentioned in the README file.\n",
            "Reading package lists...\n",
            "Reading package lists...\n",
            "Building dependency tree...\n",
            "Reading state information...\n",
            "vim is already the newest version (2:8.0.0197-4+deb9u1).\n",
            "zip is already the newest version (3.0-11+b1).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 28 not upgraded.\n",
            "\n",
            "## Installing the NodeSource Node.js 8.x LTS Carbon repo...\n",
            "\n",
            "\n",
            "## Populating apt-get cache...\n",
            "\n",
            "+ apt-get update\n",
            "Hit:1 http://security.debian.org/debian-security stretch/updates InRelease\n",
            "Ign:2 http://deb.debian.org/debian stretch InRelease\n",
            "Hit:3 http://deb.debian.org/debian stretch-updates InRelease\n",
            "Hit:4 http://deb.debian.org/debian stretch Release\n",
            "Hit:5 https://deb.nodesource.com/node_8.x stretch InRelease\n",
            "Reading package lists...\n",
            "\n",
            "## Confirming \"stretch\" is supported...\n",
            "\n",
            "+ curl -sLf -o /dev/null 'https://deb.nodesource.com/node_8.x/dists/stretch/Release'\n",
            "\n",
            "## Adding the NodeSource signing key to your keyring...\n",
            "\n",
            "+ curl -s https://deb.nodesource.com/gpgkey/nodesource.gpg.key | apt-key add -\n",
            "OK\n",
            "\n",
            "## Creating apt sources list file for the NodeSource Node.js 8.x LTS Carbon repo...\n",
            "\n",
            "+ echo 'deb https://deb.nodesource.com/node_8.x stretch main' > /etc/apt/sources.list.d/nodesource.list\n",
            "+ echo 'deb-src https://deb.nodesource.com/node_8.x stretch main' >> /etc/apt/sources.list.d/nodesource.list\n",
            "\n",
            "## Running `apt-get update` for you...\n",
            "\n",
            "+ apt-get update\n",
            "Hit:1 http://security.debian.org/debian-security stretch/updates InRelease\n",
            "Ign:2 http://deb.debian.org/debian stretch InRelease\n",
            "Hit:3 http://deb.debian.org/debian stretch-updates InRelease\n",
            "Hit:4 http://deb.debian.org/debian stretch Release\n",
            "Hit:5 https://deb.nodesource.com/node_8.x stretch InRelease\n",
            "Reading package lists...\n",
            "\n",
            "## Run `sudo apt-get install -y nodejs` to install Node.js 8.x LTS Carbon and npm\n",
            "## You may also need development tools to build native addons:\n",
            "     sudo apt-get install gcc g++ make\n",
            "## To install the Yarn package manager, run:\n",
            "     curl -sL https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -\n",
            "     echo \"deb https://dl.yarnpkg.com/debian/ stable main\" | sudo tee /etc/apt/sources.list.d/yarn.list\n",
            "     sudo apt-get update && sudo apt-get install yarn\n",
            "\n",
            "\n",
            "Reading package lists...\n",
            "Building dependency tree...\n",
            "Reading state information...\n",
            "nodejs is already the newest version (8.16.0-1nodesource1).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 28 not upgraded.\n",
            "Install components using python and pip\n",
            "/usr/bin/npm -> /usr/lib/node_modules/npm/bin/npm-cli.js\n",
            "/usr/bin/npx -> /usr/lib/node_modules/npm/bin/npx-cli.js\n",
            "+ npm@6.9.0\n",
            "updated 1 package in 13.258s\n",
            "6.9.0\n",
            "Python 3.7.3\n",
            "pip 19.0.3 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)\n",
            "Requirement already satisfied: spacy in /usr/local/lib/python3.7/site-packages (2.1.4)\n",
            "Requirement already satisfied: textacy in /usr/local/lib/python3.7/site-packages (0.7.0)\n",
            "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/site-packages (from spacy) (1.0.2)\n",
            "Requirement already satisfied: plac<1.0.0,>=0.9.6 in /usr/local/lib/python3.7/site-packages (from spacy) (0.9.6)\n",
            "Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/site-packages (from spacy) (1.16.3)\n",
            "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/site-packages (from spacy) (2.21.0)\n",
            "Requirement already satisfied: wasabi<1.1.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from spacy) (0.2.2)\n",
            "Requirement already satisfied: srsly<1.1.0,>=0.0.5 in /usr/local/lib/python3.7/site-packages (from spacy) (0.0.5)\n",
            "Requirement already satisfied: jsonschema<3.1.0,>=2.6.0 in /usr/local/lib/python3.7/site-packages (from spacy) (3.0.1)\n",
            "Requirement already satisfied: preshed<2.1.0,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from spacy) (2.0.1)\n",
            "Requirement already satisfied: blis<0.3.0,>=0.2.2 in /usr/local/lib/python3.7/site-packages (from spacy) (0.2.4)\n",
            "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from spacy) (2.0.2)\n",
            "Requirement already satisfied: thinc<7.1.0,>=7.0.2 in /usr/local/lib/python3.7/site-packages (from spacy) (7.0.4)\n",
            "Requirement already satisfied: cachetools>=2.0.0 in /usr/local/lib/python3.7/site-packages (from textacy) (3.1.0)\n",
            "Requirement already satisfied: pyemd>=0.3.0 in /usr/local/lib/python3.7/site-packages (from textacy) (0.5.1)\n",
            "Requirement already satisfied: networkx>=1.11 in /usr/local/lib/python3.7/site-packages (from textacy) (2.3)\n",
            "Requirement already satisfied: pyphen>=0.9.4 in /usr/local/lib/python3.7/site-packages (from textacy) (0.9.5)\n",
            "Requirement already satisfied: joblib>=0.13.0 in /usr/local/lib/python3.7/site-packages (from textacy) (0.13.2)\n",
            "Requirement already satisfied: cytoolz>=0.8.0 in /usr/local/lib/python3.7/site-packages (from textacy) (0.9.0.1)\n",
            "Requirement already satisfied: python-levenshtein>=0.12.0 in /usr/local/lib/python3.7/site-packages (from textacy) (0.12.0)\n",
            "Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/site-packages (from textacy) (1.2.1)\n",
            "Requirement already satisfied: scikit-learn<0.21.0,>=0.17.0 in /usr/local/lib/python3.7/site-packages (from textacy) (0.20.3)\n",
            "Requirement already satisfied: tqdm>=4.11.1 in /usr/local/lib/python3.7/site-packages (from textacy) (4.32.1)\n",
            "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)\n",
            "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2019.3.9)\n",
            "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.8)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from jsonschema<3.1.0,>=2.6.0->spacy) (40.8.0)\n",
            "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/site-packages (from jsonschema<3.1.0,>=2.6.0->spacy) (19.1.0)\n",
            "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.7/site-packages (from jsonschema<3.1.0,>=2.6.0->spacy) (0.15.2)\n",
            "Requirement already satisfied: six>=1.11.0 in /usr/local/lib/python3.7/site-packages (from jsonschema<3.1.0,>=2.6.0->spacy) (1.12.0)\n",
            "Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/site-packages (from networkx>=1.11->textacy) (4.4.0)\n",
            "Requirement already satisfied: toolz>=0.8.0 in /usr/local/lib/python3.7/site-packages (from cytoolz>=0.8.0->textacy) (0.9.0)\n",
            "Requirement already satisfied: jupyterlab in /root/.local/lib/python3.7/site-packages (0.35.6)\n",
            "Requirement already satisfied: pandas in /root/.local/lib/python3.7/site-packages (0.24.2)\n",
            "Requirement already satisfied: nltk in /root/.local/lib/python3.7/site-packages (3.4.1)\n",
            "Requirement already satisfied: pytextrank in /root/.local/lib/python3.7/site-packages (1.1.0)\n",
            "Requirement already satisfied: matplotlib in /root/.local/lib/python3.7/site-packages (3.0.3)\n",
            "Requirement already satisfied: notebook>=4.3.1 in /root/.local/lib/python3.7/site-packages (from jupyterlab) (5.7.8)\n",
            "Requirement already satisfied: jupyterlab-server<0.3.0,>=0.2.0 in /root/.local/lib/python3.7/site-packages (from jupyterlab) (0.2.0)\n",
            "Requirement already satisfied: python-dateutil>=2.5.0 in /root/.local/lib/python3.7/site-packages (from pandas) (2.8.0)\n",
            "Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.7/site-packages (from pandas) (1.16.3)\n",
            "Requirement already satisfied: pytz>=2011k in /root/.local/lib/python3.7/site-packages (from pandas) (2019.1)\n",
            "Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from nltk) (1.12.0)\n",
            "Requirement already satisfied: statistics in /root/.local/lib/python3.7/site-packages (from pytextrank) (1.0.3.5)\n",
            "Requirement already satisfied: spacy in /usr/local/lib/python3.7/site-packages (from pytextrank) (2.1.4)\n",
            "Requirement already satisfied: graphviz in /root/.local/lib/python3.7/site-packages (from pytextrank) (0.10.1)\n",
            "Requirement already satisfied: datasketch in /root/.local/lib/python3.7/site-packages (from pytextrank) (1.4.3)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.7/site-packages (from pytextrank) (2.3)\n",
            "Requirement already satisfied: kiwisolver>=1.0.1 in /root/.local/lib/python3.7/site-packages (from matplotlib) (1.1.0)\n",
            "Requirement already satisfied: cycler>=0.10 in /root/.local/lib/python3.7/site-packages (from matplotlib) (0.10.0)\n",
            "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /root/.local/lib/python3.7/site-packages (from matplotlib) (2.4.0)\n",
            "Requirement already satisfied: Send2Trash in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (1.5.0)\n",
            "Requirement already satisfied: tornado<7,>=4.1 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (6.0.2)\n",
            "Requirement already satisfied: ipython-genutils in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (0.2.0)\n",
            "Requirement already satisfied: ipykernel in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (5.1.0)\n",
            "Requirement already satisfied: prometheus-client in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (0.6.0)\n",
            "Requirement already satisfied: pyzmq>=17 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (18.0.1)\n",
            "Requirement already satisfied: nbformat in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (4.4.0)\n",
            "Requirement already satisfied: nbconvert in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (5.5.0)\n",
            "Requirement already satisfied: jupyter-client>=5.2.0 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (5.2.4)\n",
            "Requirement already satisfied: traitlets>=4.2.1 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (4.3.2)\n",
            "Requirement already satisfied: jinja2 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (2.10.1)\n",
            "Requirement already satisfied: jupyter-core>=4.4.0 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (4.4.0)\n",
            "Requirement already satisfied: terminado>=0.8.1 in /root/.local/lib/python3.7/site-packages (from notebook>=4.3.1->jupyterlab) (0.8.2)\n",
            "Requirement already satisfied: jsonschema>=2.6.0 in /usr/local/lib/python3.7/site-packages (from jupyterlab-server<0.3.0,>=0.2.0->jupyterlab) (3.0.1)\n",
            "Requirement already satisfied: docutils>=0.3 in /root/.local/lib/python3.7/site-packages (from statistics->pytextrank) (0.14)\n",
            "Requirement already satisfied: thinc<7.1.0,>=7.0.2 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (7.0.4)\n",
            "Requirement already satisfied: blis<0.3.0,>=0.2.2 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (0.2.4)\n",
            "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (2.0.2)\n",
            "Requirement already satisfied: plac<1.0.0,>=0.9.6 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (0.9.6)\n",
            "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (2.21.0)\n",
            "Requirement already satisfied: preshed<2.1.0,>=2.0.1 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (2.0.1)\n",
            "Requirement already satisfied: wasabi<1.1.0,>=0.2.0 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (0.2.2)\n",
            "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (1.0.2)\n",
            "Requirement already satisfied: srsly<1.1.0,>=0.0.5 in /usr/local/lib/python3.7/site-packages (from spacy->pytextrank) (0.0.5)\n",
            "Requirement already satisfied: redis>=2.10.0 in /root/.local/lib/python3.7/site-packages (from datasketch->pytextrank) (3.2.1)\n",
            "Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/site-packages (from networkx->pytextrank) (4.4.0)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.8.0)\n",
            "Requirement already satisfied: ipython>=5.0.0 in /root/.local/lib/python3.7/site-packages (from ipykernel->notebook>=4.3.1->jupyterlab) (7.5.0)\n",
            "Requirement already satisfied: testpath in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (0.4.2)\n",
            "Requirement already satisfied: entrypoints>=0.2.2 in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (0.3)\n",
            "Requirement already satisfied: defusedxml in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (0.6.0)\n",
            "Requirement already satisfied: pygments in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (2.4.0)\n",
            "Requirement already satisfied: bleach in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (3.1.0)\n",
            "Requirement already satisfied: pandocfilters>=1.4.1 in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (1.4.2)\n",
            "Requirement already satisfied: mistune>=0.8.1 in /root/.local/lib/python3.7/site-packages (from nbconvert->notebook>=4.3.1->jupyterlab) (0.8.4)\n",
            "Requirement already satisfied: MarkupSafe>=0.23 in /root/.local/lib/python3.7/site-packages (from jinja2->notebook>=4.3.1->jupyterlab) (1.1.1)\n",
            "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /root/.local/lib/python3.7/site-packages (from terminado>=0.8.1->notebook>=4.3.1->jupyterlab) (0.6.0)\n",
            "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/site-packages (from jsonschema>=2.6.0->jupyterlab-server<0.3.0,>=0.2.0->jupyterlab) (19.1.0)\n",
            "Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib/python3.7/site-packages (from jsonschema>=2.6.0->jupyterlab-server<0.3.0,>=0.2.0->jupyterlab) (0.15.2)\n",
            "Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /usr/local/lib/python3.7/site-packages (from thinc<7.1.0,>=7.0.2->spacy->pytextrank) (4.32.1)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pytextrank) (2019.3.9)\n",
            "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pytextrank) (3.0.4)\n",
            "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pytextrank) (1.24.3)\n",
            "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy->pytextrank) (2.8)\n",
            "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /root/.local/lib/python3.7/site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (4.7.0)\n",
            "Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /root/.local/lib/python3.7/site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (2.0.9)\n",
            "Requirement already satisfied: backcall in /root/.local/lib/python3.7/site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (0.1.0)\n",
            "Requirement already satisfied: jedi>=0.10 in /root/.local/lib/python3.7/site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (0.13.3)\n",
            "Requirement already satisfied: pickleshare in /root/.local/lib/python3.7/site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (0.7.5)\n",
            "Requirement already satisfied: webencodings in /root/.local/lib/python3.7/site-packages (from bleach->nbconvert->notebook>=4.3.1->jupyterlab) (0.5.1)\n",
            "Requirement already satisfied: wcwidth in /root/.local/lib/python3.7/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (0.1.7)\n",
            "Requirement already satisfied: parso>=0.3.0 in /root/.local/lib/python3.7/site-packages (from jedi>=0.10->ipython>=5.0.0->ipykernel->notebook>=4.3.1->jupyterlab) (0.4.0)\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)\n",
            "E: Unable to lock directory /var/lib/apt/lists/\n",
            "Warning: apt-key output should not be parsed (stdout is not a terminal)\n",
            "You are using pip version 19.0.3, however version 19.1.1 is available.\n",
            "You should consider upgrading via the 'pip install --upgrade pip' command.\n",
            "You are using pip version 19.0.3, however version 19.1.1 is available.\n",
            "You should consider upgrading via the 'pip install --upgrade pip' command.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "CPU times: user 10 ms, sys: 10 ms, total: 20 ms\n",
            "Wall time: 31 s\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RtvtZg2afczS"
      },
      "source": [
        "#### Install Spacy model ( NOT optional )\n",
        "\n",
        "Install the large English language model for spaCy - will be needed for the examples in this notebooks.\n",
        "\n",
        "**Note:** from observation it appears that spaCy model should be installed towards the end of the installation process, it avoid errors when running programs using the model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "HLvwAn-DfczT",
        "outputId": "77fe7daf-6186-4b97-d248-82807d92260e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 279
        }
      },
      "source": [
        "%%time\n",
        "%%bash\n",
        "\n",
        "python -m spacy download en_core_web_lg\n",
        "python -m spacy link en_core_web_lg en || true"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: en_core_web_lg==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.1.0/en_core_web_lg-2.1.0.tar.gz#egg=en_core_web_lg==2.1.0 in /usr/local/lib/python3.7/site-packages (2.1.0)\n",
            "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
            "You can now load the model via spacy.load('en_core_web_lg')\n",
            "\n",
            "\u001b[38;5;1m✘ Link 'en' already exists\u001b[0m\n",
            "To overwrite an existing link, use the --force flag\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "You are using pip version 19.0.3, however version 19.1.1 is available.\n",
            "You should consider upgrading via the 'pip install --upgrade pip' command.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "CPU times: user 0 ns, sys: 10 ms, total: 10 ms\n",
            "Wall time: 3.59 s\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "kwXgEdM8oeUv"
      },
      "source": [
        "## Examples"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "AX1pZlKofczb"
      },
      "source": [
        "### Extract entities"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Pe2IZleafczd",
        "colab": {}
      },
      "source": [
        "import sys\n",
        "sys.path.insert(0, '../../library')\n",
        "sys.path.insert(0, './awesome-ai-ml-dl/examples/better-nlp/library')\n",
        "\n",
        "from org.neomatrix369.better_nlp import BetterNLP"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "cu1TA5ZJfczi",
        "colab": {}
      },
      "source": [
        "# Can be any factual text or any text to experiment with\n",
        "generic_text = \"Denis Guedj (1940 – April 24, 2010) was a French novelist and a professor of the History of Science at Paris VIII University. He was born in Setif. He spent many years devising courses and games to teach adults and children math. He is the author of Numbers: The Universal Language and of the novel The Parrot's Theorem. He died in Paris. \"\n",
        "\n",
        "betterNLP = BetterNLP() ### do not re-run this unless you wish to re-initialise the object"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "iaihh-wofczq",
        "outputId": "7b0dfe83-0ab2-4485-e05b-c2605e3292c8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 104
        }
      },
      "source": [
        "model_loading_result = betterNLP.load_nlp_model()\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "print(\"model_loading_time_in_secs=\",model_loading_result['model_loading_time_in_secs'])\n",
        "print(\"model_loading_method=\",model_loading_result['model_loading_method'])\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "\n",
        "model = model_loading_result[\"model\"]"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Loading model 'en'...\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "model_loading_time_in_secs= 15.27812910079956\n",
            "model_loading_method= directly, first time\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "empjCbipfczu",
        "outputId": "d58b032e-a7d2-41d0-dc07-8e2267f8c552",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 570
        }
      },
      "source": [
        "extracted_entities = betterNLP.extract_entities(model, generic_text)\n",
        "\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "print(\"extract_entities_processing_time_in_secs=\", extracted_entities['extract_entities_processing_time_in_secs'])\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "betterNLP.pretty_print(extracted_entities[\"extracted_entities\"])\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "betterNLP.pretty_print(betterNLP.token_entity_types())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "extract_entities_processing_time_in_secs= 0.06418943405151367\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "                     text        label\n",
            "0             Denis Guedj       PERSON\n",
            "1   1940 – April 24, 2010         DATE\n",
            "2                  French         NORP\n",
            "3  the History of Science          ORG\n",
            "4   Paris VIII University          ORG\n",
            "5                   Setif         NORP\n",
            "6              many years         DATE\n",
            "7    The Parrot's Theorem  WORK_OF_ART\n",
            "8                   Paris          GPE\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "    Entity type                                          Description\n",
            "0        PERSON                          People, including fictional\n",
            "1          NORP       Nationalities or religious or political groups\n",
            "2           FAC          Buildings, airports, highways, bridges, etc\n",
            "3           ORG               Companies, agencies, institutions, etc\n",
            "4           GPE                            Countries, cities, states\n",
            "5           LOC  Non-GPE locations, mountain ranges, bodies of water\n",
            "6       PRODUCT        Objects, vehicles, foods, etc. (Not services.\n",
            "7         EVENT  Named hurricanes, battles, wars, sports events, etc\n",
            "8   WORK_OF_ART                          Titles of books, songs, etc\n",
            "9           LAW                       Named documents made into laws\n",
            "10     LANGUAGE                                   Any named language\n",
            "11         DATE                Absolute or relative dates or periods\n",
            "12         TIME                             Times smaller than a day\n",
            "13      PERCENT                            Percentage, including ”%“\n",
            "14        MONEY                      Monetary values, including unit\n",
            "15     QUANTITY               Measurements, as of weight or distance\n",
            "16      ORDINAL                               “first”, “second”, etc\n",
            "17     CARDINAL         Numerals that do not fall under another type\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "T7A3p05vfcz0"
      },
      "source": [
        "### Noun extraction"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "hetkfphHfcz2",
        "outputId": "0c6d1d9b-3bd7-45ad-b5df-c3c8cb2feceb",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 190
        }
      },
      "source": [
        "chunks = betterNLP.extract_noun_chunks(model, generic_text)\n",
        "chunks = chunks.get(\"noun_chunks\")\n",
        "\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "betterNLP.pretty_print(chunks)\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "  Words belonging together (lowercase)\n",
            "0                          denis guedj\n",
            "1                   universal language\n",
            "2                         1940 – april\n",
            "3                paris viii university\n",
            "4                      french novelist\n",
            "5                           many years\n",
            "6                        children math\n",
            "7                     parrot's theorem\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "s2kyTI6bfcz7"
      },
      "source": [
        "### Gather facts"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "aKxvH1Ynfcz_",
        "outputId": "3a267644-b8cc-41c6-bdca-f5b89a495db5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 104
        }
      },
      "source": [
        "target_topic = \"Denis Guedj\"\n",
        "extracted_facts = betterNLP.extract_facts(model, generic_text, target_topic)\n",
        "\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "betterNLP.pretty_print(extracted_facts.get(\"facts\"))\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "                                                                Facts about Denis Guedj\n",
            "0  a French novelist and a professor of the History of Science at Paris VIII University\n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4jQLb4Sqfc0E"
      },
      "source": [
        "### Obfuscate privacy details"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "_bmI2UcSfc0F",
        "outputId": "049c4553-d542-43f5-8dc3-c479f7c0720d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 89
        }
      },
      "source": [
        "obfuscated_text = betterNLP.obfuscate_text(model, generic_text)\n",
        "obfuscated_text = obfuscated_text.get(\"obfuscated_text\")\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")\n",
        "print(\"Obfuscated generic text: \", \"\".join(obfuscated_text))\n",
        "print(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
            "Obfuscated generic text:  [OBFUSCATED] (1940 – April 24, 2010) was a French novelist and a professor of the History of Science at Paris VIII University. He was born in Setif. He spent many years devising courses and games to teach adults and children math. He is the author of Numbers: The Universal Language and of the novel The Parrot's Theorem. He died in Paris. \n",
            "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "H_JfZhqwfc0J",
        "colab": {}
      },
      "source": [
        "### Parts-of-speech"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2L4gWuIRW7c7",
        "colab_type": "code",
        "colab": {},
        "outputId": "e51b51d8-897b-4456-a489-8fac8d37a924"
      },
      "source": [
        "generic_text = u'Apple is looking at buying U.K. startup for $1 billion'\n",
        "parts_of_speech_tagging = betterNLP.parts_of_speech_tagging(model, generic_text).get(\"parts_of_speech\")\n",
        "betterNLP.pretty_print(parts_of_speech_tagging)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "      token    lemma parts-of-speech  tag syntactic dependency  shape  \\\n",
            "0     Apple    Apple           PROPN  NNP                nsubj  Xxxxx   \n",
            "1        is       be            VERB  VBZ                  aux     xx   \n",
            "2   looking     look            VERB  VBG                 ROOT   xxxx   \n",
            "3        at       at             ADP   IN                 prep     xx   \n",
            "4    buying      buy            VERB  VBG                pcomp   xxxx   \n",
            "5      U.K.     U.K.           PROPN  NNP             compound   X.X.   \n",
            "6   startup  startup            NOUN   NN                 dobj   xxxx   \n",
            "7       for      for             ADP   IN                 prep    xxx   \n",
            "8         $        $             SYM    $             quantmod      $   \n",
            "9         1        1             NUM   CD             compound      d   \n",
            "10  billion  billion             NUM   CD                 pobj   xxxx   \n",
            "\n",
            "    is_alphanumeric  is_stop_word  \n",
            "0              True         False  \n",
            "1              True          True  \n",
            "2              True         False  \n",
            "3              True          True  \n",
            "4              True         False  \n",
            "5             False         False  \n",
            "6              True         False  \n",
            "7              True          True  \n",
            "8             False         False  \n",
            "9             False         False  \n",
            "10             True         False  \n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}
